Genetic variations and gene expression profiles of Rice Black-streaked dwarf virus (RBSDV) in different host plants and insect vectors: insights from RNA-Seq analysis

Rice black-streaked dwarf virus (RBSDV) is an etiological agent of a destructive disease infecting some economically important crops from the Gramineae family in Asia. While RBSDV causes high yield losses, genetic characteristics of replicative viral populations have not been investigated within different host plants and insect vectors. Herein, eleven publicly available RNA-Seq datasets from Chinese RBSDV-infected rice, maize, and viruliferous planthopper (Laodelphax striatellus) were obtained from the NCBI database. The patterns of SNP and RNA expression profiles of expected RBSDV populations were analyzed by CLC Workbench 20 and Geneious Prime software. These analyses discovered 2,646 mutations with codon changes in RBSDV whole transcriptome and forty-seven co-mutated hotspots with high variant frequency within the crucial regions of S5-1, S5-2, S6, S7-1, S7-2, S9, and S10 open reading frames (ORFs) which are responsible for some virulence and host range functions. Moreover, three joint mutations are located on the three-dimensional protein of P9-1. The infected RBSDV-susceptible rice cultivar KTWYJ3 and indigenous planthopper datasets showed more co-mutated hotspot numbers than others. Our analyses showed the expression patterns of viral genomic fragments varied depending on the host type. Unlike planthopper, S5-1, S2, S6, and S9-1 ORFs, respectively had the greatest read numbers in host plants; and S5-2, S9-2, and S7-2 were expressed in the lowest level. These findings underscore virus/host complexes are effective in the genetic variations and gene expression profiles of plant viruses. Our analysis revealed no evidence of recombination events. Interestingly, the negative selection was observed at 12 RBSDV ORFs, except for position 1015 in the P1 protein, where a positive selection was detected. The research highlights the potential of SRA datasets for analysis of the virus cycle and enhances our understanding of RBSDV’s genetic diversity and host specificity. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-024-10649-9.

Genetic analysis of RBSDV has revealed variations in different genomic segments and selective pressures acting on them.Structural proteins such as P2 and P4 show higher conservation compared to non-structural protein P9.The S9 genomic segment exhibits the highest nucleotide diversity, while the S10 segment contains the highest number of conserved regions.Furthermore, the previous report has indicated that S5-2 and S2 were under the highest and lowest selective pressure, respectively [19].Several RBSDV proteins, such as P6, P7-1, P7-2, P9-1, and P10, are involved in viral pathogenesis and interaction with host factors [11,[20][21][22].
A viral species consists of populations of different mutants which are called "quasispecies" [23].RNA viruses, including RBSDV, exhibit high mutation rates, which contribute to genetic diversity and evolution.Recombination, selection pressure, and genetic drift play significant roles in shaping the quasispecies structure of viral populations [24][25][26][27].Understanding the rate and nature of these changes is crucial for developing effective strategies to control viral diseases [28].Pathogenhost interactions also influence genetic diversity within viral populations to adapt to different hosts and tissues [29,30].Molecular diagnostic methods, such as nextgeneration sequencing (NGS), including RNA sequencing (RNA-Seq) and small RNA sequencing (sRNA-Seq), have provided valuable tools for studying plant viromes [31,32].For decades, the identification and diagnosis of plant viruses was limited to proteins-based immunological tests such as the ELISA method or sequencing nucleotide fragments by polymerase chain reaction (PCR).Due to the lack of easy access to the entire transcriptome viral populations, the genetic diversity and evolution of plant viruses remained unknown.The recent most advanced tools based on NGS, whole RNA sequencing, and Metagenomics have greatly helped in investigating expression levels and the genetic diversity of plant viruses.Analyzing viral transcriptomes from entire populations can unveil a hidden reservoir of mutations within viral communities, representing the final genetic variations before protein expression [33][34][35][36][37].To this end, for the first time we used the transcriptomic datasets of the RBSDV to reveal the mutations and genetic variation that occurred in these populations and to compare expression levels in the segments in the plant and insect hosts.
In this study, we aimed to analyze Sequence Read Archive (SRA) datasets from various plant and insect hosts to investigate mutations and genomic variations within RBSDV populations.We examined protein changes resulting from mutations and estimated the frequency of variant forms.Additionally, we explored the expression levels of each genomic fragment within the viral transcriptomes of different hosts.

RNA-Seq datasets from infected hosts
RNA-Seq datasets from infected hosts were acquired for this study.We identified and excluded some of the data with low coverage during analysis.A total of eleven RNA-Seq datasets were obtained from the SRA-NCBI database with good quality and suitable coverage for the virus genome, and these datasets were originally derived from Chinese RBSDV-infected rice, maize, and the viruliferous planthopper, L. striatellus, generated from 2017 to 2020.The specific sequence datasets used in the current investigation are documented in Table 1.These datasets were generated using the advanced Illumina HiSeq 2000-4000 techniques, which have proven to be highly effective in analyzing RNA sequences.

Preparing transcriptomic reads
Preparing transcriptomic reads involved several steps to ensure high-quality datasets.Initially, the quality control (QC) of the reads was carefully examined.To optimize the transcriptomic datasets, the reads were trimmed using CLC Genomics Workbench 20, software provided by QIAGEN.This trimming process involved removing adapters, ambiguous nucleotides, and low-quality regions from the fastq datasets.Default parameters were used in CLC Genomics Workbench, where bases below 15 nucleotides, a maximum of 2 ambiguous nucleotides, and a Qscore of < = 5 were considered for trimming.Subsequently, the trimmed reads were mapped to the reference genome of RBSDV (accession numbers NC_003728-NC_003737) that encompasses all ten double-stranded RNA (dsRNA) genomic fragments.This mapping procedure allowed for aligning the transcriptomic reads to the corresponding locations in the RBSDV reference genome, facilitating further analysis and interpretation.

De novo genome assembly and virus sequence annotation
De novo genome assembly and virus sequence annotation were performed using mapped reads.The reads were collected and utilized for de novo assembly, employing default parameters such as a word size of 20 and a minimum contig length of 200 nt.Subsequently, the obtained contigs were subjected to annotation using an open reading frame (ORF) finding tool in Geneious Prime 2022, a software package developed in the Netherlands specifically for this purpose.Furthermore, a new reference genome was generated from our data, enabling the discovery of single nucleotide polymorphisms (SNPs) within intrapopulation.

Assessment of genetic diversity
To investigate the genetic diversity, the RNA-Seq datasets were utilized in this study, and the virus sequences generated from these datasets were employed as the reference genome.The CLC Genomics Workbench was employed for the analysis.The following thresholds were set: a minimum coverage of 2, a minimum variant frequency of 0.01, a maximum variant p-value of 10 − 6 , and a minimum strand-bias p-value of 10 − 5 .By using the Geneious Prime software, the impact of genetic diversity on translational changes was examined.This analysis included an exploration of polymorphism types, protein effects, variant frequency, coding sequence (CDS) positions, amino acid changes, codon changes, and variant p-values.Furthermore, the assessment of SNPs within the ORFs of the virus was conducted.For the visualization of SNPs on the 3D protein structures, the Protein Data Bank (PDB) was downloaded from the RCSB PDB database (https:// www.rcsb.org)and integrated with the CLC Genomics Workbench.

Analysis of Virus Gene expression
In order to assess the gene expression profile for each ORF, the RNA-Seq dataset was aligned to the reference genome using the CLC Workbench software, utilizing default parameters as part of the RNA-Seq analysis option.These parameters included a length fraction of 80%, a fraction similarity of 80%, and costs of 2 for mismatches, 2 for deletions, and 3 for insertions.The virus reference genome was transformed into a genome track and gene tracks.The mapping results were then utilized to calculate the transcripts per million (TPM) for each ORF.The read counts were averaged across the replicates that were used.Because all samples are from different studies and may have some differences such as using poly A fraction or total RNA, we normalized all samples using "Normalize expression value" tools in CLC genomic workbench software with the normalize method ("by totals").Finally, the gene expression profiles were compared not only among all genomic fragments but also between different hosts.

Recombination analysis of virus fragments from various hosts
This study employed Geneious Prime software to analyze recombination events between RBSDV genetic fragments isolated from three different hosts: rice, maize,  [39], j Planted in the fields where had frequently happened severe RBSDV infection in the maize, collected from Jinan province, China [40], and k Indigenous L. striatellus (Fallen) were collected from Haian, China and reared on RBSDV-infected rice [41] and an insect vector.The analysis focused on the coding sequences of the viral genetic fragments.

Selective pressure on viral segments in different hosts
To

Codon usage bias analysis in different hosts
The study also analyzed the codon usage patterns of the viral CDS from each host.Codons are the three-letter sequences in RNA that specify amino acids during protein synthesis.Organisms can exhibit a preference for certain codons for specific amino acids, even though multiple codons can code for the same amino acid (synonymous codons).This preference is termed codon usage bias.The analysis involved isolating the coding sequences from all three hosts (rice, maize, and insect vector) and converting them into a FASTA format file.Subsequently, R software (version 9 of RStudio) with specific sequence analysis packages was employed.The coRdon package [42] was used for sequence management and manipulation.The sequences were imported into R using the readSet function from coRdon.To ensure that only coding regions were included, the check_cds function was employed.Subsequently, codon frequencies were calculated using the count_codons function, which determines the number of times each codon appears in the datasets.The analysis calculated descriptive statistics (mean, median, standard deviation, and range) for all these metrics (CAI, ENC, GC content, and GC3S).Finally, histograms were generated to visualize the distribution of these values across the entire dataset.

Viral genome Assembly
The QC scores of RNA-Seq datasets were checked to ensure that the transcriptomic data were suitable for further analysis.In different RNA-Seq datasets, the number of short reads ranged from 15 to 99 million (reading depth).The GC content was between 40% and 50%, and all reads in each dataset had the same length (length distribution) (Table 1).Two factors, enriched 5-mers and nucleotide contributions, appeared to be normal [43].The per-sequence analysis indicated that most datasets had 0% of the ambiguous base.However, the content was less than 0.08% in the MG1 dataset from L. striatellus and it was only 7% in a few base positions of the reads, being 0% in RBSDV1, 2, and 3 datasets from rice.Furthermore, the QC pre-base analysis revealed that all datasets covered the complete length of reads, with a few exceptions in certain nucleotides for MG1, MG2, and MG3 datasets from L. striatellus, which had 95% coverage (Table 2).The clean reads were then mapped to the RBSDV reference genome.The highest percentage of mapped reads, 7.21%, was obtained from the RBSDV1, 2, and 3 datasets from rice.Approximately 0.01-0.02% of the total bases were mapped to the viral reference genome.
The result showed that the virus isolates in the present study had a genetic organization typical of the reference genome (Fig. 1).The genome of the Rice black-streaked dwarf virus consists of ten dsRNA molecules and each genomic fragment includes one or two segments [44,45].The virus ORFs encode 13 structural and non-structural proteins which play their role in the pathogenicity, virus replication, and the construction of viroplasms and tubular structures in the insect and plant host cells [16], (Fig. 1).

Genetic Diversity of RBSDV in different hosts
To ensure more reliable results in investigating the polymorphism within replicative virus populations across different hosts, we combined multiple repeats of SRA datasets obtained from each host, including 6, 2, and 3 datasets obtained from RBSDV-infected rice and corn, and the viruliferous planthopper (Table 1), respectively.Subsequently, two crucial factors were evaluated: the frequency of variants and the conserved/influential regions during the virus cycle that affect the pathogenic cycle and host range in each ORF.We analyzed mutations that had codon-changing effects, such as substitutions, frameshifts, or deletions of the start codon.We specifically focused on mutants with a high frequency of variants and mutations occurring in genomic regions with critical functions in virus pathogenesis.We identified forty-seven mutated hotspots that co-occurred in numerous datasets with a high frequency of variants (HFV) (Table 3).Other mutants located in important regions did not necessarily co-occur in the datasets or exhibited a lower variant frequency (Supplementary 1).A total of 2646 mutations were associated with codon and protein shifts across the datasets (Supplementary 2).
All of the recognized SNPs with HVFs are localized in the important regions of S5-1, S5-2, S6, S7-1, S7-2, S9-1, and S10 ORFs [19,20,22,[46][47][48][49] (Table 3).The results showed that many single nucleotide polymorphisms with substitution protein effects (SPE) occurred in the replicative populations in different hosts, with a high frequency of variants.Many recognized SNPs were able to change codons and subsequently encode proteins with different variant frequencies.The RBSDV-susceptible rice cultivar KTWYJ3 datasets (D1, D2-1, and D2-2) contained the greatest number of hotspot mutants.In comparison, the datasets related to the RBSDV-infected rice cultivar Wuyujing 7 (RBSDV1, RBSDV2, and RBSDV3) had a lower frequency in the number of hotspot mutants.The mutated hotspots were abundantly found in the indigenous L. striatellus midgut of China (Haian) datasets (RB MG1, RB MG2, and RB MG3).The mutation in RBSDVinfected maize (Zea mays B73) datasets (b73 t1 and b73 t2) was more common in the hotspot of proteins P6, P7-2, P7-1, and P10, respectively.Furthermore, a total of thirty-two SNPs were detected in at least two different hosts, and five special SNPs were only recognized in L. striatellus (Table 3).Therefore, an accumulation of mutations in crucial regions of RBSDV was observed in the plant host cultivars and the native Chinese planthopper.Mutation, recombination, and reassortment are the primary forces driving genetic variation in viruses.RNA viruses and reverse transcribing (RT) viruses generally exhibit higher mutation rates (10^-6-10^-4 substitutions per nucleotide per cell infection) compared to doublestranded or single-stranded DNA viruses (10^-8-10^-6) [50][51][52][53][54].This elevated mutation rate in RNA and RT viruses stems from the error-prone nature of their RNAdependent RNA polymerase and RNA-dependent DNA polymerase (retrotranscriptases, RT), which lack proofreading and base excision repair mechanisms [55].The transfer of newly introduced and indigenous viral species to native cultivars in a new area is one of the management challenges of viral diseases.Tomato yellow leaf curl disease (TYLCD), which is economically the most important viral pathogen in tomatoes, has been endemic in the Middle East.TYLCD gradually spread to Jordan and Iran through transmission from native infected hosts into new tomato cultivars [56].Worst of all, TYLCD created severe pathogenicity in new variants due to the accumulation of mutations, recombination, and reassortment during the dissemination process [57].Moreover, plant viruses can have genetic diversity in different cultivars of a plant.The transcriptomics analysis of indigenous and introduced potato cultivars revealed genetic diversity in the sequences of PVM, PVY, PVH and PVS viruses; and represented a heterogeneous distribution of the presence of pathogens in indigenous and introduced cultivars.More interestingly, a higher accumulation of single nucleotide polymorphisms was estimated in the underground tissues of the potatoes [30].
Their amino acid identity with a query cover of 100% was 80.39%, and the expected domains at positions 108 to 126 and 286 to 303 amino acids showed 89% and 78% similarity, respectively.We identified a deletion of five amino acids (= 15 nucleotides) outside the two transmembrane domains in the S7-1 encoded protein (from amino acids 341 to 345) of SRBSDV.In contrast, Zhou et al. previously reported the removal of only 8 nucleotides from the S8 segment of SRBSDV compared to RBSDV [61].Given the high identity between the two protein sequences and their similar function in host tubule formation, we considered the possibility of these two domains being important in RBSDV and checked for SNPs in these domains (Table 3).In S7-2, the SNPs E293.3KP7,I166.3LP7,R52.3CP7, G5.3SP7, Q288.7RP7,N149.7SP7, and S195.3RP7 occurred with high variable frequency (HVF) in all three hosts.The N-terminal region of ORF S7-2 is important which interacts with OSGJD2 and ZeaGID2 in plants [21].The S10 viral genomic fragment in rice, maize, and the viruliferous planthopper exhibited high variant frequency SNPs.Specifically, C256.3SP10 in the conserved region of TM2, L124.3FP10 in the conserved region of TM1, and S68.7LP10, F108.0LP10,Y88.7CP10 in the N-terminal region.The RBSDV P10, a major external capsid protein with 558 amino acids and a molecular weight of 60 kDa, demonstrates multifunctionality in its interactions with both viral and host factors during viral infection.This protein is also known as integral membrane protein which causes stress in the endoplasmic reticular (ER) and consequently, the pronounced protein responses (for example, the activity of an inhibitor of actin polymerization) appear in plants [11,22].Previous studies have reported the presence of three conserved transmembrane domains: TM1 (119 to 137 aa), TM2 (250 to 270 aa), and TM3 (480 to 500 aa) in the S10 encoded protein [48].Additionally, the N-terminal region of the S10 viral genome fragment, spanning amino acids 1 to 270, plays a crucial role in interacting with amino acids in LSRACK1 of the small brown planthopper, preventing RBSDV accumulation in cells [22].Other studies showed that mutations in the Pepper mild mottle virus (PMMoV) coat protein can reportedly overcome L-gene resistance in pepper [62].Moreover, for a virus to thrive through horizontal transmission by insect vectors, it needs a smooth two-step: efficient acquisition by the insect vectors and successful transmission to a new host.If a virus is acquired but struggles to be transmitted further, its spread within the plant population is severely limited [63].Fascinatingly, viruses can manipulate their plant hosts in various ways.They might induce the production of specific morphological features or alter plant phenotypes, making them more attractive to insect vectors [64].This, in turn, increases the chance of the virus reaching a new host.However, the exact mechanisms of how viruses manipulate insect vector selection and influence their dissemination success are complex [65,66].These manipulations have significant evolutionary implications [67].The insect vector feeding preferences play a crucial role, as they determine the types of hosts the virus encounters.This shapes the virus's evolutionary trajectory, pushing it towards becoming a specialist or a generalist pathogen [7].
To gain a better understanding of the molecular implications of the identified genetic variations on the threedimensional (3D) structure of proteins, we retrieved the PDB database and mapped the amino acid changes onto the RBSDV 3D proteins.Only mutations related to the P9 protein's 3D structure could be linked to the PDB database.Numerous predicted protein changes were observed in various regions of the P9 protein, but the most significant alterations occurred at positions 168, 84, 20, 32, 151, 295, 104, 137, and 138 aa in the P9-1 regions (Fig. 3; Table 3; Supplementary 3).A previous study demonstrated that these amino acids are involved in the interaction between P9-1 and the viral P6 protein.Yeast two-hybrid assays revealed that even minor changes in the amino acids 1-347 of the P9-1 protein can disrupt the P9-1/P6 interaction and hinder replication processes [20].
Therefore, in total, the functions of pathogenicity and host range are carried out by the aforementioned encoded proteins P5, P6, P7, P9, and P10, each of which has conserved and effective regions for performing their functions [19,20,22,[46][47][48][49].The RBSDV genome segments exhibited many mutations (2,646) with codon changes and different VFs.Approximately forty-seven co-mutated hotspots were identified in the important genetic regions of fragments P5-1, P5-2, P6, P7-1, P7-2, P9-1, and P10 through datasets, representing an extensive genetic resource for future changes of this virus in related functions.Most of the significant hotspots with high frequency were found in the populations of several hosts and datasets, indicating their rapid spread in RBSDV populations and serving as the likely reason for the formation of dominant populations.RBSDV is endemic to East Asian countries, such as China.The disease occurs in intermittent epidemic processes, making forecasting difficult [16].Therefore, it is predicted that the mutated genetic resources will be frequently replicated in an epidemic with a massive reproduction rate.
The rice cultivar KTWYJ3 (RBSDV-susceptible) and indigenous L. striatellus datasets had the highest number of hotspots and the highest number of identical hotspots.Additionally, seven hotspots were observed in crucial regions of proteins exclusively in the L. striatellus datasets (including P5-2, overlapping P5-1 and P5-2, P6, P7-1, and P7-2).These findings suggest the potentially high sensitivity of the indigenous L. striatellus to RBSDV and highlight the high genetic diversity in these two types of RBSDV-infected populations: the rice cultivar KTWYJ3 and the Chinese indigenous L. striatellus.Moreover, some significant SNPs were identified on the 3D protein of P9-1 in RB MG1, RB MG2, RB MG3, D2-2, RBSDV1, RBSDV2, RBSDV3, and b73 t2 datasets.
Although the rate of genetic changes within the genome of RNA viruses is 104-107 times higher than that of their hosts, the virus's ability to form pathogenicity and adapt to the host can be influenced by the surrounding environment [23,68].Mutations serve as valuable resources in populations, potentially facilitating virus transmission to new insect vectors or hosts [69,70].
In comparison, we observed a smaller number of significant mutants in the rice cultivar Wuyujing 7 (RBSDV1, RBSDV2 and RBSDV3 datasets).However, these mutated hotspots also appeared in several other hosts.The presence of pro-viral host factors enables the virus to replicate and spread throughout the entire host [71,72].These pro-viral host factors can transform a tolerant host into a susceptible one.Host susceptibility depends on the balance between pro-viral host factors and suppressive responses.Any alteration in this balance leads to varying degrees of sensitivity to the virus [73][74][75][76].Studies suggest that host factors are influenced by competition between host species and impact pathogenicity evolution [77].For instance, the increase in pathogenicity of the TYLCD virus during its transfer to new tomato cultivars and its spread from the Middle East to the East was attributed to the accumulation of mutations, recombination, and reassortment [56,57].Mutations introduce genetic variation, which serves as the raw material for evolution and adaptation [78][79][80][81].Laboratory studies reveal a fascinating interplay between plant RNA viruses and host immune mechanisms.Deficiencies in different host defense pathways significantly influence the rate of viral evolution, the types of genetic adaptations that emerge, and even the level of specialization the virus develops [82].Adapting to specific host defenses is a complex challenge for viruses, as the host's genetic makeup plays a crucial role in shaping the evolutionary arms race [83].Furthermore, plant populations exhibit a remarkable heterogeneity in their defense responses, ranging from tolerance to susceptibility [84,85].This variation in host defenses plays a significant role in shaping the patterns of viral evolution, driving the emergence of various viral strains.

Viral gene expression
The analysis of transcripts revealed distinct gene expression profiles for each dsRNA genomic fragment and ORF in various plant and insect hosts (Fig. 4).In infected rice (SRX8967824-26, SRX2653517, and SRX2730361-62 SRA datasets), the S5- The average TPM values were calculated to determine the transcription levels of genomic fragments in different hosts.The ORFs S5-1, S2, S9-1, and S6 exhibited the highest transcription levels with read numbers of 126,689.2,123,338.8,104,608.3, and 92,363.8,respectively (Fig. 2d).While the genomic fragments with the highest expression levels were similar in rice and maize, the expression levels of ORFs changed after entering the insect vector.The gene expression profile showed that S4, S5-2, and S9-2 ORF in both RBSDV-susceptible rice cultivar KTWYJ3 and maize datasets, as well as S5-2 ORF in rice cultivar wuyujing 7 datasets, exhibited low expression levels (or TPM).In contrast, the viruliferous planthopper datasets showed more expression uniformity, with increased expression levels observed for S5-2, S9-2, and S7-2 ORFs compared to other hosts (Fig. 4a).In both host plants, the highest expression levels were observed for S5-1, S2, S6, and S9-1 ORFs.In the insect vector, however, the highest expression levels were detected in S9-2, S9-1, S4, and S2 ORFs.The S2 ORF encodes the major core structural protein [4,13].The P5-1, P6, and P9-1 ORFs are responsible for producing viroplasm inclusions involved in RBSDV replication and assembly [86][87][88].Although the functions and interactions of RBSDV-encoded proteins remain unclear, our findings suggest that S5-2, S7-2, and S9-2 may play an important role in virus/L.striatellus interactions.

The recombination events and selection pressure
Our analysis showed no recombinant was found in all fragments of the virus genome.For the first time, it appears that protein P1 has a positive selection site (CDS position site:1015) with a p-value of 0.03 and a likelihood ratio test (LRT) value of 5.56.This suggests that there is a site in protein P1 that is under positive selection pressure.
Fig. 4 (a) Expression level of RBSDV genomic segments within SRA-sequences datasets (Table 1).(b) The expression level of RBSDV genomic segments in average data in each host.TPM: Transcript per million.The RNA-Seq dataset was aligned to the reference genome using the CLC Workbench software Positive selection pressure occurs when mutations are beneficial to the organism and are therefore favored by natural selection.This can lead to the rapid evolution of the gene [89].The LRT chart shows the LRT values for each site across the genes.A higher LRT value indicates a stronger signal for positive selection.Overall, the analysis suggests that most of the RBSDV genes are under purifying selection pressure in different hosts, which means that mutations are being selected against.However, there is evidence of positive selection pressure at one site in protein P1.More analysis would be needed to determine the specific function of this gene and the role of the positively selected site (Fig. 5j).Previous studies showed that RBSDV displays a lower frequency of recombination events compared to some other viruses [90].Moreover, the 13 RBSDV ORFs had already been showing that were under a negative selection (Ka/Ks < 1) [19].

Codon usage bias
Based on the analysis, the effective number of codons (ENC) values are similar across the three hosts (L.striatellus, maize, and rice).The ENC values range from 0.20 to 0.40 for all three hosts.This suggests that there is a similar level of codon usage bias within genes across all three hosts (Fig. 5d-f ).The codon adaptation index (CAI) values appear to be higher in rice compared to L. striatellus and maize.The CAI values for rice range from 0.60 to 0.80, while the CAI values for insect vector and maize range from 0.40 to 0.60.A higher CAI value indicates a stronger bias towards codons that are frequently used in highly expressed genes.This suggests that genes in rice may be more codon-optimized for translation than genes in L. striatellus and maize (Fig. 5a-c).
The GC (GC content) and GC3S (GC content at synonymous third positions) values also appear to be higher in rice compared to L. striatellus and maize.The GC and GC3S values for rice range from 0.50 to 0.70, while the GC and GC3S values for insect vector and maize range from 0.30 to 0.50.This suggests that rice may have a higher overall GC content and a higher proportion of G and C nucleotides at synonymous third codon positions compared to insect vectors and maize (Fig. 5g-i).Overall, the results suggest that there may be some differences in codon usage bias among the three hosts.Rice appears to have a higher CAI and GC content compared to L. striatellus and maize, suggesting that genes in rice may be more codon-optimized for translation.However, the ENC values are similar across all three hosts, suggesting that there is a similar level of codon usage bias within genes.

Comparison of RSCU in Rice, Maize and Laodelphax striatellus
Supplementary 4, shows the relative synonymous codon usage (RSCU) values for all genes of virus isolated from three hosts: rice, maize, and L. striatellus.RSCU is a measure of codon bias in a gene, indicating how frequently synonymous codons are used compared to the expected usage if all codons were used equally.A value of 1 in the RSCU (Supplementary 4) indicates no bias, values greater than 1 indicate a positive bias (codon preferred), and values less than 1 indicate a negative bias (codon disfavored).

The comparison for some amino acids
Phenylalanine (Phe): All three hosts show a preference for TTT codon over TTC.Rice and L. striatellus have a stronger bias towards TTT compared to Maize.Leucine (Leu): All three hosts show a preference for CT codon families (CTA, CTG, CTT) over TT codon families (TTA, TTG).Rice has the strongest bias towards CT codons, followed by L. striatellus and then Maize.Serine (Ser): All three hosts show a preference for the TCT codon over other Serine codons (TCC, TCA, TCG).Rice has the strongest bias towards TCT, followed by insect vectors and Maize.Arginine (Arg): All three hosts show a preference for the AGA codon over other Arginine codons (CGT, CGC, CGA, CGG).L. striatellus exhibits the strongest bias towards AGA, followed by Rice and Maize.Overall, the RSCU analysis reveals differences in codon usage preferences between the virus in rice, maize, and insect vectors.This suggests that the virus might have adapted its codon usage to the specific tRNA pool of each host for efficient translation.Furthermore, RBSDV is under negative or purifying selection, meaning mutations that disrupt essential functions are less likely to persist [19,49].ENC-plot and neutrality-plot analyses on two proteins P8 and P10 indicated that natural selection plays a major role in shaping the codon usage patterns of RBSDV and CAI analyses had a strong correlation between RBSDV and rice rather than other hosts (maize, wheat, or Laodelphax striatellus) [90].While negative selection likely acts on most RBSDV fragments, the presence of numerous co-mutated hotspots across diverse populations suggests these mutations might confer an advantage to the virus.This advantage could explain the high frequency of these mutations in fragments under negative selection, allowing the virus to effectively spread across different host populations.Negative selection acts to eliminate deleterious mutations in viral proteins.These mutations disrupt essential functions and hinder the virus's ability to replicate and spread.High-frequency mutations in a protein can seem contradictory to negative selection [91].However, it is important to understand the nature of these mutations.In some cases, high-frequency mutations might represent escape mutations that allow the virus to evade the immune system, or to resist antiviral drugs or pesticides.These mutations can be beneficial in specific environments and would be under positive selection [92][93][94][95].Therefore, the evolutionary pressure exerted by high-frequency mutations depends on the specific type of mutation.
This study aimed to identify the overall pool of mutations present in the transcriptomes of RBSDV populations from diverse hosts, including rice, maize, and insect vectors.Due to the variation in dataset collection, including different years, hosts, and regions, there is a possibility that some mutations (especially co-mutations) have become established within these specific viral populations.However, low-frequency mutations also deserve consideration.These mutations, though currently rare, could become more abundant and even dominant under certain environmental pressures.Overall, the presence of co-mutations within 3 years (2017-2020) suggests the RBSDV population is evolving.The specific implications depend on the type of mutation, selection pressures, and the virus itself.

Conclusion
The RBSDV is a significant threat to the main food sources such as rice, maize, and other grain crops worldwide, leading to substantial economic losses.Originating in East Asian countries like China, the disease causes intermittent epidemics.In this study, we investigated the RBSDV transcriptomic populations through native L. striatellus and some plant hosts (RBSDV-susceptible/or normal) in China datasets, focusing on specific genome fragments and encoded proteins (P5-1, P5-2, P6, P7-1, P7-2, P9-1, P10) associated with pathogenicity and hosting.By analyzing viral proteins involved in transmission, formation of viroplasm, replication, assembly, and interaction with viral and plant factors, we identified forty-seven co-mutated hotspots with highly variable frequencies (HVF) in crucial regions.Among the RBSDVinfected populations, the RBSDV-susceptible rice cultivar KTWYJ3 and indigenous L. striatellus displayed the highest number of hotspots, with seven unique to L. striatellus.These findings suggest the insect vector's high sensitivity and genetic diversity.Through a comprehensive survey, we discovered 2,646 single nucleotide polymorphisms (SNPs) and codon changes in the RBSDV whole transcriptome, highlighting numerous mutated hotspots in key proteins.Identical hotspots with high frequencies were prevalent in several RBSDV-infected host populations, indicating the rapid spread of co-mutated hotspots and the formation of dominant populations.
Gene expression analysis revealed distinct patterns between plant hosts and the insect vector, suggesting correlations between specific genomic fragments and RBSDV actions in L. striatellus.Despite many unclear functions and interactions for RBSDV-encoded proteins, we propose that P5-2, P7-2, and P9-2 play vital roles in virus/planthopper interactions.Additionally, the mentioned genomic fragments in the planthopper showed higher specificity in hotspot mutations, potentially indicating increased mutational pressure in their crucial domains.Although some hotspots were identified in the most likely critical regions of the P7-1 genomic fragment, in future studies further examination with advanced tools is recommended.Moreover, the influence of host factors in the process of RBSDV evolution with a deeper examination in future studies seems necessary.Overall, our study unveils the extensive genetic diversity in RBSDV populations, which could lead to changes in the plant host and insect vector types, potentially expanding the host range and virulence evolution of RBSDV.

Fig. 2
Fig. 2 Average variant frequency within different ORFs in three hosts, rice (a), maize (b) and viruliferous planthopper (c).d.Average total Transcripts Per Million (TPM) in RBSDV genomic fragments using the CLC Workbench software

Fig. 3
Fig. 3 The mutations on the three-dimensional protein of P9-1 in amino acids.The reference amino acids (aa) have been shown with purple color and mutation positions have been shown in green color.a. mutation in aa position 168.b. mutation in aa position 20.c. mutation in aa position 84.Analysis was done by connecting to the PDB database

Fig. 5 a
Fig. 5 a-c.Histogram of Codon Adaptation Index (CAI) in L. striatellus, Rice and maize hosts, respectively.d-f.Histogram of Effective Number of Codons (ENC) in L. striatellus, Rice and maize hosts, respectively.g-i.Histogram of GC in L. striatellus, Rice and maize hosts, respectively.j.Likelihood Ratio Test (LRT) chart shows a positive selection of the P1 protein of RBSDV in CDS position site 1015

Table 1
Properties of RNA-Seq datasets from RBSDV infected hosts which were analyzed in the present study

Table 2
Characteristics of mapped and non-mapped reads of infected-RBSDV datasets to RBSDV reference genome Fig. 1 Schematic representation of Rice black-streaked dwarf virus genome.RBSDV genome consists of ten double-stranded fragments encoding 13 proteins with different functions.Pr: Protein.JA: Jasmonic acid

Table 3
Single-nucleotide polymorphisms (SNPs) among replicative populations of RBSDV in different plant and insect hosts